Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you should use R Markdown syntax.
# This is a so-called "R chunk" where you can write R code.
date()
## [1] "Thu Nov 19 11:22:05 2020"
"I am Ambrin from the University of Eastern Finland"
## [1] "I am Ambrin from the University of Eastern Finland"
"I am based in Kuopio."
## [1] "I am based in Kuopio."
"I am doing my PhD in Health Sciences"
## [1] "I am doing my PhD in Health Sciences"
"I am planning to learn basics of R and concepts about open data science here"
## [1] "I am planning to learn basics of R and concepts about open data science here"
"I am curious"
## [1] "I am curious"
"I heard about this course from the yammer platform of my University"
## [1] "I heard about this course from the yammer platform of my University"
"Its great to learn new things."
## [1] "Its great to learn new things."
"The link to my github repository is https://github.com/ambrinbabu/IODS-project/"
## [1] "The link to my github repository is https://github.com/ambrinbabu/IODS-project/"
The text continues here.
Ambrin@LAPTOP-P8RFDGP7:/home/Ambrin$ git config –global Ambrin.email ambrin.babu@uef.fi
Describe the work you have done this week and summarize your learning.
date()
## [1] "Thu Nov 19 11:22:05 2020"
We would like to read the students2014 data into R from the url provided and would like to explore the data.
#Read the students2014 data into R from url and store in students2014
students2014 <- read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/learning2014.txt",sep=",", header=TRUE )
head(students2014)
## gender age attitude deep stra surf points
## 1 F 53 3.7 3.583333 3.375 2.583333 25
## 2 M 55 3.1 2.916667 2.750 3.166667 12
## 3 F 49 2.5 3.500000 3.625 2.250000 24
## 4 M 53 3.5 3.500000 3.125 2.250000 10
## 5 M 49 3.7 3.666667 3.625 2.833333 22
## 6 F 38 3.8 4.750000 3.625 2.416667 21
#Explore the structure of the data
str(students2014)
## 'data.frame': 166 obs. of 7 variables:
## $ gender : chr "F" "M" "F" "M" ...
## $ age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ attitude: num 3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ points : int 25 12 24 10 22 21 21 31 24 26 ...
The data is in the form of a data frame and as we can see, there are 7 columns of data representing gender, age, attitude, deep, stra, surf and points. There are 166 observations (rows). The data comes from ASSIST (The approaches and study skills inventory for students) and includes information about 116 students of different age groups and comprise both male and female and approaches they use for learning - surface approach (memorise without understanding with a serious lack of personal engagement in learning process), deep approach (intention to maximise understanding with a true commitment to learning) and strategic approach (apply any strategy that maximises the chance of chieving highest possible grades). The student achievments are measured by points in the exams
#Explore the dimensions of the data
dim(students2014)
## [1] 166 7
#ggplot2 is a popular library for creating stunning graphics with R.
#install.packages("ggplot2")
library(ggplot2)
#Show a graphical overview of the data
p1 <- ggplot(students2014, aes(x = attitude, y = points, col = gender))
# define the visualization type (points)
p2 <- p1 + geom_point()
# add a regression line
p3 <- p2 + geom_smooth(method = "lm")
# add a main title and draw the plot
p4 <- p3+ ggtitle("Student's attitude versus exam points")
p4
## `geom_smooth()` using formula 'y ~ x'
Here we see a graphical overview of the data and see the relationship between the attitude vs exam points
#Show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them
##Simple regression
# a scatter plot of points versus attitude
library(ggplot2)
qplot(attitude, points, data = students2014) + geom_smooth(method = "lm")
## `geom_smooth()` using formula 'y ~ x'
# fit a linear model
my_model <- lm(points ~ attitude, data = students2014)
# print out a summary of the model
summary(my_model)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
We see here a linear model fitted for our data.Linear regression model is a simple statistical model.It is an approach for modelling the relationship between a dependent variable y and one or more exploratory variables x.The model is found by minimising the sum of the residuals.Residuals are essentially the difference between the actual observed response values and the response values that the model predicted.
The exam points are the target variable and attitude is the explanatory variable. The variables are evenly distributed. The summary of the variables in the data is also shown.The Residuals section of the model output breaks it down into 5 summary points - min, 1Q, median, 2Q and max. The coefficients gives estimates for the parameters of the model.In this case the p value for attitude is very low. So there is a statistical relationship between attitude and points.
#Choose three variables as explanatory variables and fit a regression model where exam points is the target (dependent) variable. Show a summary of the fitted model and comment and interpret the results. Explain and interpret the statistical test related to the model parameters. If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.
my_model2 <- lm(points ~ attitude + stra + surf, data = students2014)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude + stra + surf, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.1550 -3.4346 0.5156 3.6401 10.8952
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.0171 3.6837 2.991 0.00322 **
## attitude 3.3952 0.5741 5.913 1.93e-08 ***
## stra 0.8531 0.5416 1.575 0.11716
## surf -0.5861 0.8014 -0.731 0.46563
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared: 0.2074, Adjusted R-squared: 0.1927
## F-statistic: 14.13 on 3 and 162 DF, p-value: 3.156e-08
Here 3 variables (attitude, strategic learning, surface learning) are explanatory variables and a multiple regression model is fitted where points is the target variable. There are 5 point summary of residuals of model - min, 1Q, median, 3Q and Max.
Below the residuals section, we see coefficients which gives estimates for the parameters of the model. The estimate corresponding to the intercept is the estimate of alpha parameter and the estimate corresponding to attitude is beta parameter. Here we estimated the effect of attitude on points to be 11.01 with standard difference of approx 3.68. We also have t and p values corresponding to statistics test of null hypothesis that the actual value of beta parameter would be 0. In this case the p value for attitude is very low. So there is a statistical relationship between attitude and points.However, for stra and surf, these values are not statistically significant and so there is no statistical relationship between stra (or surf) and points.
Residual Standard Error is the measure of the quality of a linear regression fit. Theoretically, every linear model is assumed to contain an error term E. Due to the presence of this error term, we are not capable of perfectly predicting our explanatory variable from the target variable. The Residual Standard Error is the average amount that the response will deviate from the true regression line. It’s also worth noting that the Residual Standard Error was calculated with 162 degrees of freedom. Simplistically, degrees of freedom are the number of data points that went into the estimation of the parameters used after taking into account these parameters.
The R-squared (R2) statistic provides a measure of how well the model is fitting the actual data. It takes the form of a proportion of variance. R2 is a measure of the linear relationship between our target variable and our explanatory variable. It always lies between 0 and 1 (i.e.: a number near 0 represents a regression that does not explain the variance in the response variable well and a number close to 1 does explain the observed variance in the response variable). In our example, the R2 we get is 0.2074. Or roughly 20% of the variance found in the explanatory variables can be explained by the target variable.
#If an explanatory variable in your model does not have a statistically significant relationship with the target variable, remove the variable from the model and fit the model again without it.
my_model2 <- lm(points ~ attitude, data = students2014)
summary(my_model2)
##
## Call:
## lm(formula = points ~ attitude, data = students2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.6372 1.8303 6.358 1.95e-09 ***
## attitude 3.5255 0.5674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
In the previous example, we saw that 2 of the explanatory variables (strategic and surface learning) did not have a statistically significant relationship with the target variable (points). Hence we removed the 2 variables and fitted the model again without it.
#Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = students2014)
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
#1 residuals vs fitted values
#2 Normal QQplot
#5 Residuals vs levarage
par(mfrow = c(2,2))
plot(my_model2, which = 2)
Statistical models always include several assumption which describe the data generating process. In a linear regression model, we assume linearity. The target variable is modelled as a linear combination of the model parameters. Usually it is assumed that the errors are normally distributed, not correlated and have constant variance. Further, its also assumed that the size of a given error does not depend on the values of the explanatory variables.
QQ-plot: QQ plot of the residuals provides a meathod to explore the assumption that the errors of the model are normally distributed. The better the points fall within the line, the better is the fit to the normality assumption. In our case, we see a reasonable fit.
#Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = students2014)
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
#1 residuals vs fitted values
#2 Normal QQplot
#5 Residuals vs levarage
par(mfrow = c(2,2))
plot(my_model2, which = 1)
The constant variance assumption implies that the size of the errors should not depend on the explanatory variables.This can be explored by plotting a scatter plot of residuals versus model predictors. In our case, we dont see when fitted values increase, spread of residuals increase, indicating a problem.
#Produce the following diagnostic plots: Residuals vs Fitted values, Normal QQ-plot and Residuals vs Leverage
# create a regression model with multiple explanatory variables
my_model2 <- lm(points ~ attitude + stra, data = students2014)
# draw diagnostic plots using the plot() function. Choose the plots 1, 2 and 5
#1 residuals vs fitted values
#2 Normal QQplot
#5 Residuals vs levarage
par(mfrow = c(2,2))
plot(my_model2, which = 5)
Leverage of observations measures how much impact a single observation has on the model.Residuals vs leverage plot can help identify which observations have an unusually high impact.We do not have one particular point with very high leverage so we can conclude that it is a regular leverage without any outliers.
(more chapters to be added similarly as we proceed with the course!)
Describe the work you have done this week and summarize your learning.
date()
## [1] "Thu Nov 19 11:22:09 2020"
# read data
alc <- read.csv("https://github.com/rsund/IODS-project/raw/master/data/alc.csv")
# explore structure and dimension
str(alc)
## 'data.frame': 370 obs. of 51 variables:
## $ school : chr "GP" "GP" "GP" "GP" ...
## $ sex : chr "F" "F" "F" "F" ...
## $ age : int 15 15 15 15 15 15 15 15 15 15 ...
## $ address : chr "R" "R" "R" "R" ...
## $ famsize : chr "GT3" "GT3" "GT3" "GT3" ...
## $ Pstatus : chr "T" "T" "T" "T" ...
## $ Medu : int 1 1 2 2 3 3 3 2 3 3 ...
## $ Fedu : int 1 1 2 4 3 4 4 2 1 3 ...
## $ Mjob : chr "at_home" "other" "at_home" "services" ...
## $ Fjob : chr "other" "other" "other" "health" ...
## $ reason : chr "home" "reputation" "reputation" "course" ...
## $ guardian : chr "mother" "mother" "mother" "mother" ...
## $ traveltime: int 2 1 1 1 2 1 2 2 2 1 ...
## $ studytime : int 4 2 1 3 3 3 3 2 4 4 ...
## $ schoolsup : chr "yes" "yes" "yes" "yes" ...
## $ famsup : chr "yes" "yes" "yes" "yes" ...
## $ activities: chr "yes" "no" "yes" "yes" ...
## $ nursery : chr "yes" "no" "yes" "yes" ...
## $ higher : chr "yes" "yes" "yes" "yes" ...
## $ internet : chr "yes" "yes" "no" "yes" ...
## $ romantic : chr "no" "yes" "no" "no" ...
## $ famrel : int 3 3 4 4 4 4 4 4 4 4 ...
## $ freetime : int 1 3 3 3 2 3 2 1 4 3 ...
## $ goout : int 2 4 1 2 1 2 2 3 2 3 ...
## $ Dalc : int 1 2 1 1 2 1 2 1 2 1 ...
## $ Walc : int 1 4 1 1 3 1 2 3 3 1 ...
## $ health : int 1 5 2 5 3 5 5 4 3 4 ...
## $ n : int 2 2 2 2 2 2 2 2 2 2 ...
## $ id.p : int 1096 1073 1040 1025 1166 1039 1131 1069 1070 1106 ...
## $ id.m : int 2096 2073 2040 2025 2153 2039 2131 2069 2070 2106 ...
## $ failures : int 0 1 0 0 1 0 1 0 0 0 ...
## $ paid : chr "yes" "no" "no" "no" ...
## $ absences : int 3 2 8 2 5 2 0 1 9 10 ...
## $ G1 : int 10 10 14 10 12 12 11 10 16 10 ...
## $ G2 : int 12 8 13 10 12 12 6 10 16 10 ...
## $ G3 : int 12 8 12 9 12 12 6 10 16 10 ...
## $ failures.p: int 0 0 0 0 0 0 0 0 0 0 ...
## $ paid.p : chr "yes" "no" "no" "no" ...
## $ absences.p: int 4 2 8 2 2 2 0 0 6 10 ...
## $ G1.p : int 13 13 14 10 13 11 10 11 15 10 ...
## $ G2.p : int 13 11 13 11 13 12 11 10 15 10 ...
## $ G3.p : int 13 11 12 10 13 12 12 11 15 10 ...
## $ failures.m: int 1 2 0 0 2 0 2 0 0 0 ...
## $ paid.m : chr "yes" "no" "yes" "yes" ...
## $ absences.m: int 2 2 8 2 8 2 0 2 12 10 ...
## $ G1.m : int 7 8 14 10 10 12 12 8 16 10 ...
## $ G2.m : int 10 6 13 9 10 12 0 9 16 11 ...
## $ G3.m : int 10 5 13 8 10 11 0 8 16 11 ...
## $ alc_use : num 1 3 1 1 2.5 1 2 2 2.5 1 ...
## $ high_use : logi FALSE TRUE FALSE FALSE TRUE FALSE ...
## $ cid : int 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010 ...
dim(alc)
## [1] 370 51
It is a dataframe with 370 observation and 51 variables. The data is about the secondary school in 2 portugese schools and their achievements. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por).The data contains information also about the different students, their age, sex etc. and if they consume high levels of alcohol or not. The alcohol consumption scale ranges from 1 (very low) to 5 (very high).
Hypothesis: My hypothesis is that many reasons affect the alcohol consumption.I have chosen 4 interesting variables - gender, famrel, goout (going out with friends) and freetime. Gender- I hypothesise that male drink more than women. famrel - When there are more family relations, less alcohol is consumed goout - when someone goes out with friends, more alcohol is consumed freetime - when there is more freetime, more alcohol is consumed
#Numerically and graphically explore the distributions of your chosen variables and their relationships with alcohol consumption (use for example cross-tabulations, bar plots and box plots). Comment on your findings and compare the results of your exploration to your previously stated hypotheses.
library(tidyr); library(dplyr); library(ggplot2)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
gather(alc) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar()
alc %>% group_by(sex, high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'sex' (override with `.groups` argument)
## # A tibble: 4 x 3
## # Groups: sex [2]
## sex high_use count
## <chr> <lgl> <int>
## 1 F FALSE 154
## 2 F TRUE 41
## 3 M FALSE 105
## 4 M TRUE 70
alc %>% group_by(failures, high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'failures' (override with `.groups` argument)
## # A tibble: 8 x 3
## # Groups: failures [4]
## failures high_use count
## <int> <lgl> <int>
## 1 0 FALSE 238
## 2 0 TRUE 87
## 3 1 FALSE 12
## 4 1 TRUE 12
## 5 2 FALSE 8
## 6 2 TRUE 9
## 7 3 FALSE 1
## 8 3 TRUE 3
alc %>% group_by(famrel, high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'famrel' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups: famrel [5]
## famrel high_use count
## <int> <lgl> <int>
## 1 1 FALSE 6
## 2 1 TRUE 2
## 3 2 FALSE 9
## 4 2 TRUE 9
## 5 3 FALSE 39
## 6 3 TRUE 25
## 7 4 FALSE 128
## 8 4 TRUE 52
## 9 5 FALSE 77
## 10 5 TRUE 23
alc %>% group_by(goout, high_use) %>% summarise(count = n())
## `summarise()` regrouping output by 'goout' (override with `.groups` argument)
## # A tibble: 10 x 3
## # Groups: goout [5]
## goout high_use count
## <int> <lgl> <int>
## 1 1 FALSE 19
## 2 1 TRUE 3
## 3 2 FALSE 82
## 4 2 TRUE 15
## 5 3 FALSE 97
## 6 3 TRUE 23
## 7 4 FALSE 40
## 8 4 TRUE 38
## 9 5 FALSE 21
## 10 5 TRUE 32
Gender - more male students drink more (according to hypothesis) Famrel - More relatives around, less alcohol is consumed (according to hypothesis) goout - When students go out, more alcohol is consumed (according to hypothesis) freetime - more free time, more students drink alcohol (according to hypothesis)
#Graphical results
g1<- ggplot(alc, aes(x = high_use, y = famrel))+ geom_boxplot() + ggtitle("Family relationship vs alcohol consumption")
g1
g2<-ggplot(alc, aes(x = high_use, y = famrel, col = sex))+ geom_boxplot() + ggtitle("Family relationship vs alcohol consumption according to gender")
g2
g3<- ggplot(alc, aes(x = high_use, y = goout))+ geom_boxplot() + ggtitle("Going out with friends vs alcohol consumption")
g3
g4<-ggplot(alc, aes(x = high_use, y = goout, col = sex))+ geom_boxplot() + ggtitle("Going out with friends vs alcohol consumption according to gender")
g4
g5<- ggplot(alc, aes(x = high_use, y = freetime))+ geom_boxplot() + ggtitle("Freetime vs alcohol consumption")
g5
g6<-ggplot(alc, aes(x = high_use, y = freetime, col = sex))+ geom_boxplot() + ggtitle("Freetime vs alcohol consumption according to gender")
g6
Logistic regression
m <- glm(high_use ~ goout + famrel + sex+ freetime, data = alc, family = "binomial")
summary(m)
##
## Call:
## glm(formula = high_use ~ goout + famrel + sex + freetime, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6569 -0.7847 -0.5274 0.8571 2.5801
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.4926 0.6994 -3.564 0.000365 ***
## goout 0.7702 0.1267 6.078 1.22e-09 ***
## famrel -0.4483 0.1414 -3.171 0.001519 **
## sexM 0.9558 0.2599 3.677 0.000236 ***
## freetime 0.1120 0.1400 0.800 0.423896
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 452.04 on 369 degrees of freedom
## Residual deviance: 378.47 on 365 degrees of freedom
## AIC: 388.47
##
## Number of Fisher Scoring iterations: 4
We see here a linear model fitted for our data.Linear regression model is a simple statistical model.It is an approach for modelling the relationship between a dependent variable y and one or more exploratory variables x.The model is found by minimising the sum of the residuals.Residuals are essentially the difference between the actual observed response values and the response values that the model predicted.
The high alcohol use is the target variable and goout + famrel + sex+ freetime is the explanatory variable. The variables are evenly distributed. The summary of the variables in the data is also shown.The Residuals section of the model output breaks it down into 5 summary points - min, 1Q, median, 2Q and max. The coefficients gives estimates for the parameters of the model.In this case the p value for goout, famrel, and sex is very low. So there is a statistical relationship between going out with friends, family relationships and sex with high alcohol use.Free time has p value above 0.05 so there is no statistical relationship between high alcohol use and free time.
#Present and interpret the coefficients of the model as odds ratios and provide confidence intervals for them. Interpret the results and compare them to your previously stated hypothesis.
coef(m)
## (Intercept) goout famrel sexM freetime
## -2.4926487 0.7701903 -0.4483457 0.9557531 0.1119551
OR <- coef(m) %>% exp
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.08269065 0.02016869 0.3154115
## goout 2.16017719 1.69707698 2.7921949
## famrel 0.63868384 0.48200915 0.8405944
## sexM 2.60062833 1.57104372 4.3613169
## freetime 1.11846266 0.85030850 1.4743895
Odd ratios s a statistic that quantifies the strength of the association between two events. If it is greater than 1 we have a positive association and if the odd ratio is smaller than 1 its a negative association. Going out with friends, sex and free time have a positve association with high alcohol use an family relation has negative association. The results are according to the stated hypothesis. The confidence intervals are widest for the the sex variable, so its effect is the most uncertain
Predictive model Predictive power of the final logistic regression model is calculated without the statistically insignificant variable - freetime
m <- glm(high_use ~ goout + famrel + sex, data = alc, family = "binomial")
# predict() the probability of high_use
probabilities <- predict(m, type = "response")
# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)
# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)
select(alc, goout, famrel, sex, high_use, probability, prediction) %>% tail(10)
## goout famrel sex high_use probability prediction
## 361 3 5 M TRUE 0.25406225 FALSE
## 362 3 4 M TRUE 0.34478448 FALSE
## 363 1 4 M TRUE 0.09640288 FALSE
## 364 4 3 M TRUE 0.64356587 TRUE
## 365 2 3 M FALSE 0.26797363 FALSE
## 366 3 4 M TRUE 0.34478448 FALSE
## 367 2 4 M FALSE 0.19155369 FALSE
## 368 4 4 M TRUE 0.53888551 TRUE
## 369 4 5 M TRUE 0.43065938 FALSE
## 370 2 4 M FALSE 0.19155369 FALSE
# creating a confusion matrix with actual values
table(high_use = alc$high_use, prediction = alc$prediction)
## prediction
## high_use FALSE TRUE
## FALSE 241 18
## TRUE 62 49
# creating a confusion matrix with predicted values
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.65135135 0.04864865 0.70000000
## TRUE 0.16756757 0.13243243 0.30000000
## Sum 0.81891892 0.18108108 1.00000000
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))+ geom_point() + ggtitle("logistic regression model")
g
241 students dont consume high levels of alcohol. 18 are predicted wrongly (dont drink alcohol but is predicted to be drinking high levels of alcohol). 49 students drink alcohol and prediction is correct but 62 students drink alcohol but it is predicted that they dont.
Training error
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2162162
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = nrow(alc)) # K-fold cross-validation, cv=training data
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2216216
Training error can be calculated by adding false positives and false negatives. This is further confirmed by loss function. Here, we can see the total proportion of inaccurately classified individuals. The number is about 22% and is not very high. So our model is performing good but can be improved.
Bonus question
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
# compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2162162
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10) # K-fold cross-validation, cv=training data
# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2351351
This model has a better test set performance (0.22) compared to model in Datacample(0.26). 10-fold cross-validation gives good estimate of the actual predictive power of the model. Low value = good
In this exercise we use Boston data from MASS-library. This dataset contains information collected by the U.S Census Service concerning housing in the area of Boston Mass. Data includes 14 variables and 506 rows
# access the MASS package and load other libraries for later analysis
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(corrplot)
## corrplot 0.84 loaded
library(dplyr)
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
# load the data
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
#Show a graphical overview of the data and show summaries of the variables in the data. Describe and interpret the outputs, commenting on the distributions of the variables and the relationships between them
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08205 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
pairs(Boston)
There are some very interesting distributions fo variables in the plot matrix. Variable rad has high and low values so the plot shows that the values are concentrated on either side of the plot.
#Correlation matrix#
#calculating the correlation matrix and correlation plot
cor_matrix <- round(cor(Boston),digits = 2)
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)
Plotted correlation matrix shows that there is some high correlation between variables: * Correlation is quite clear between industrial areas (indus) and nitrogen oxides (nox). Industry adds pollution in the area. Industry seems to correlate also with variablrs like age, dis, ras and tax. * Nitrogen oxides (nox) correlations are very similar with industry (indus) * Crime rate (crim) seems to correlate with good accessibilitty to radial highways (rad) and value property (tax). * Old houses (age) and employment centers have also something common
#Scaled data# All the variables are numerical so we can use scale()-function to scale whole data set
#Standardize the dataset and print out summaries of the scaled data. How did the variables change?
boston_scaled <- scale(Boston)
summary(boston_scaled)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
class(boston_scaled)
## [1] "matrix" "array"
boston_scaled <- as.data.frame(boston_scaled)
Scaling the data makes variables look as if they are in the same range. Variables like black and tax were before scaling hundred fold compared to some other variables
#Create a categorical variable of the crime rate in the Boston dataset (from the scaled crime rate). Use the quantiles as the break points in the categorical variable. Drop the old crime rate variable from the dataset.
#save the scaled crim as scaled_crim
scaled_crim <- boston_scaled$crim
#create a quantile vector of crim and print it
bins <- quantile(scaled_crim)
bins
## 0% 25% 50% 75% 100%
## -0.419366929 -0.410563278 -0.390280295 0.007389247 9.924109610
#create a categorical variable 'crime'
crime <- cut(scaled_crim, breaks = bins, include.lowest = TRUE, label = c("low", "med_low", "med_high", "high"))
#look at the table of the new factor crime, do not change the actual variable "crime"
crime_tab <-table(crime)
crime_tab
## crime
## low med_low med_high high
## 127 126 126 127
#remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)
# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)
summary(boston_scaled)
## zn indus chas nox
## Min. :-0.48724 Min. :-1.5563 Min. :-0.2723 Min. :-1.4644
## 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723 1st Qu.:-0.9121
## Median :-0.48724 Median :-0.2109 Median :-0.2723 Median :-0.1441
## Mean : 0.00000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723 3rd Qu.: 0.5981
## Max. : 3.80047 Max. : 2.4202 Max. : 3.6648 Max. : 2.7296
## rm age dis rad
## Min. :-3.8764 Min. :-2.3331 Min. :-1.2658 Min. :-0.9819
## 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049 1st Qu.:-0.6373
## Median :-0.1084 Median : 0.3171 Median :-0.2790 Median :-0.5225
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617 3rd Qu.: 1.6596
## Max. : 3.5515 Max. : 1.1164 Max. : 3.9566 Max. : 1.6596
## tax ptratio black lstat
## Min. :-1.3127 Min. :-2.7047 Min. :-3.9033 Min. :-1.5296
## 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049 1st Qu.:-0.7986
## Median :-0.4642 Median : 0.2746 Median : 0.3808 Median :-0.1811
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332 3rd Qu.: 0.6024
## Max. : 1.7964 Max. : 1.6372 Max. : 0.4406 Max. : 3.5453
## medv crime
## Min. :-1.9063 low :127
## 1st Qu.:-0.5989 med_low :126
## Median :-0.1449 med_high:126
## Mean : 0.0000 high :127
## 3rd Qu.: 0.2683
## Max. : 2.9865
#Train and test set# Training set contains 80% of the data. 20% is in the test set
#Divide the dataset to train and test sets, so that 80% of the data belongs to the train set
# number of rows in the Boston dataset
n <- nrow(boston_scaled)
# choose randomly 80% of the rows
ind <- sample(n, size = n * 0.8)
ind
## [1] 189 104 403 453 134 448 475 379 423 296 185 65 154 352 64 246 490 73
## [19] 473 159 302 63 255 209 165 8 74 345 320 182 419 322 223 156 455 288
## [37] 427 421 214 369 250 395 160 153 300 442 12 211 60 479 99 29 71 291
## [55] 360 88 35 164 97 181 241 233 335 55 21 271 264 217 87 113 155 413
## [73] 429 408 258 315 391 94 452 66 277 274 265 215 309 3 20 334 424 487
## [91] 366 27 136 458 34 422 18 180 170 105 194 61 261 440 299 117 219 19
## [109] 70 239 504 46 494 284 451 118 464 5 262 439 22 347 83 206 370 353
## [127] 25 171 240 390 257 416 188 179 232 441 205 460 147 17 149 15 186 411
## [145] 176 469 409 245 145 84 7 447 476 319 491 40 31 28 406 459 129 471
## [163] 138 38 236 192 228 357 398 287 174 466 275 168 397 254 454 348 201 308
## [181] 230 428 23 497 294 269 26 196 286 92 400 418 495 349 295 272 114 132
## [199] 89 54 417 336 387 229 237 298 216 327 37 43 260 95 346 162 467 425
## [217] 394 125 210 59 207 119 310 472 24 124 140 474 478 195 499 465 107 354
## [235] 102 371 221 289 363 436 183 115 306 173 242 72 414 148 47 383 135 244
## [253] 82 314 58 44 151 281 449 333 341 238 78 9 169 365 381 404 377 368
## [271] 317 42 410 480 304 468 268 364 235 456 90 131 218 415 133 492 76 126
## [289] 243 463 502 331 202 187 450 77 361 405 111 85 412 273 51 110 75 190
## [307] 328 431 344 312 279 338 305 270 56 139 401 378 109 224 142 372 399 128
## [325] 251 146 420 122 39 248 143 443 489 367 222 67 457 137 200 388 330 482
## [343] 144 384 506 350 163 79 30 259 503 80 175 100 355 106 484 32 488 249
## [361] 193 10 501 324 150 393 380 434 462 16 48 96 323 321 303 339 483 177
## [379] 93 373 493 231 112 166 375 337 283 392 234 53 141 6 14 81 247 204
## [397] 13 161 386 11 121 227 359 326
# create train set
train <- boston_scaled[ind,]
# create test set
test <- boston_scaled[-ind,]
#Fitting the Linear Discriminant Analysis# First the linear discriminant analysis (LDA) is fitted to the train set. The new categorical variable crime is the target variable and all the other variables of the dataset are predictor variables. After fitting we draw the LDA biplot with arrows
#Fit the linear discriminant analysis on the train set. Use the categorical crime rate as the target variable and all the other variables in the dataset as predictor variables. Draw the LDA (bi)plot.
#LDA = linear discriminant analysis
lda.fit <- lda(crime ~. , data = train)
#print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
##
## Prior probabilities of groups:
## low med_low med_high high
## 0.2376238 0.2450495 0.2623762 0.2549505
##
## Group means:
## zn indus chas nox rm age
## low 0.9013537 -0.8650714 -0.10828322 -0.8517176 0.42567188 -0.8716644
## med_low -0.1056770 -0.2539862 -0.03371693 -0.5766104 -0.14007951 -0.3728070
## med_high -0.3885419 0.1669443 0.24766530 0.3804698 0.09691199 0.4208289
## high -0.4872402 1.0170891 -0.04298342 1.0449929 -0.43871124 0.8060424
## dis rad tax ptratio black lstat
## low 0.8267243 -0.6732207 -0.7543941 -0.3412866 0.3747636 -0.75622521
## med_low 0.3589258 -0.5387253 -0.4473719 -0.1017024 0.3499318 -0.12719618
## med_high -0.3489345 -0.4314738 -0.3148704 -0.2239220 0.1045227 0.01197604
## high -0.8516330 1.6384176 1.5142626 0.7811136 -0.7229913 0.84898651
## medv
## low 0.53702433
## med_low -0.02674077
## med_high 0.12855003
## high -0.62628309
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## zn 0.08306703 0.77172618 -1.03906861
## indus 0.04748549 -0.09394423 0.34229580
## chas -0.09598624 -0.08548403 0.07740603
## nox 0.37099951 -0.71287138 -1.41062398
## rm -0.12347542 -0.13069335 -0.15168260
## age 0.18329175 -0.44710766 -0.17143956
## dis -0.09559292 -0.31480080 0.22721641
## rad 3.47448075 0.95322317 -0.06120180
## tax -0.05805092 -0.14051149 0.74269590
## ptratio 0.15330851 0.07261385 -0.56944170
## black -0.12832555 0.05037934 0.22558652
## lstat 0.18654097 -0.13651966 0.38431656
## medv 0.19545008 -0.24254472 -0.25590207
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.9518 0.0336 0.0145
#the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#target classes as numeric
classes <- as.numeric(train$crime)
classes
## [1] 2 2 4 4 3 4 4 4 4 2 2 1 3 1 2 2 2 2 3 3 1 2 1 2 3 2 2 1 3 1 4 2 3 3 4 1 4
## [38] 4 2 4 2 4 3 3 1 4 2 2 2 4 1 3 2 1 4 1 3 3 2 1 2 3 1 1 3 3 3 1 1 2 3 4 4 4
## [75] 3 3 4 1 4 1 2 2 3 3 3 1 3 1 4 4 4 3 3 4 3 4 3 1 3 2 1 2 3 4 1 2 2 3 2 2 1
## [112] 2 2 1 4 2 4 1 3 4 3 1 1 2 4 1 3 3 2 4 1 4 1 1 3 4 1 4 3 3 3 3 1 4 1 4 4 2
## [149] 3 1 2 4 4 3 2 1 3 3 4 4 3 4 3 1 3 1 3 4 4 1 2 3 1 3 4 3 4 1 1 1 3 4 3 3 2
## [186] 3 3 1 1 1 4 4 3 1 1 2 2 3 1 1 4 1 4 3 3 2 2 3 2 2 3 1 1 3 4 4 4 2 3 2 2 2
## [223] 3 4 3 2 3 4 4 1 2 4 2 1 2 4 3 1 4 4 2 2 1 2 2 2 4 3 2 4 3 2 1 3 1 2 3 1 4
## [260] 1 1 3 2 2 3 3 4 4 4 4 3 2 4 4 2 4 3 4 3 4 1 3 1 4 3 2 2 2 2 4 1 1 1 1 4 2
## [297] 4 4 2 1 4 2 2 3 1 2 2 4 1 3 1 1 1 2 1 2 4 4 2 3 3 4 4 3 2 3 4 1 2 2 3 4 2
## [334] 4 3 1 4 3 1 4 1 4 4 4 1 1 3 1 3 3 1 2 2 1 1 2 3 3 4 2 2 2 2 3 3 4 4 4 4 3
## [371] 2 2 3 2 2 1 4 1 1 4 2 3 2 3 4 1 1 4 3 1 3 1 3 1 3 1 2 3 4 2 1 3 4 2
#plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 3)
#Predicting the classes#
#Save the crime categories from the test set and then remove the categorical crime variable from the test dataset. Then predict the classes with the LDA model on the test data. Cross tabulate the results with the crime categories from the test set. Comment on the results
#save the correct classes from test data
correct_classes <- test$crime
#remove the crime variable from test data
test <- dplyr::select(test, -crime)
#predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)
#cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 20 11 0 0
## med_low 6 11 10 0
## med_high 0 3 15 2
## high 0 0 0 24
Prediction were quite good. There was some errors in the middle of the range but classes low and especially high were good. Only one correct representative of high class was predicted to med_low class.
#Reload the Boston dataset and standardize the dataset (we did not do this in the Datacamp exercises, but you should scale the variables to get comparable distances). Calculate the distances between the observations. Run k-means algorithm on the dataset. Investigate what is the optimal number of clusters and run the algorithm again. Visualize the clusters (for example with the pairs() or ggpairs() functions, where the clusters are separated with colors) and interpret the results
#Loading and scaling Boston data
scaled_Boston <- scale(Boston)
summary(scaled_Boston)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv
## Min. :-1.5296 Min. :-1.9063
## 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 3.5453 Max. : 2.9865
#calculating the euclidean distance matrix between the observation
dist_eu <- dist(scaled_Boston)
#determining the max number of clusters
cluster_max <- 15
#calculate the total within sum of squares
#K-means might produce different results every time, because it randomly
#assigns the initial cluster centers. The function set.seed() can be used to
#deal with that.
set.seed(123)
twcss <- sapply(1:cluster_max, function(k){kmeans(dist_eu, k)$tot.withinss})
# visualize the results
plot(1:cluster_max, twcss, type='b')
One way to determine the number of clusters is to look how the total of within cluster sum of squares (WCSS) behaves when the number of clusters changes. WCSS was calculated from 1 to 15 clusters. The optimal number of clusters is when the total WCSS drops radically. It seems that in this case optimal number of clusters is two. However we are here to learn so I decided to analyse model with four clusters.
After determining the number of clusters I run the K-means alcorithm again
#k-means clustering
km <-kmeans(dist_eu, centers = 4)
# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)
It seems that when the data is divided to four clusters there is some clear differences in distriputions of several variables. Crim, zn, indus and blacks are variables were one can distinguish clear patterns between clusters. Some variables (rad & tax) show that sometimes 1 or 2 clusters make a clear distripution but observation of other two clusters are ambigious and there is no clear pattern to be regognised.
#BONUS: LDA using clusters as target classes#
#Perform k-means on the original Boston data with some reasonable number of clusters (> 2). Remember to standardize the dataset. Then perform LDA using the clusters as target classes. Include all the variables in the Boston data in the LDA model. Visualize the results with a biplot (include arrows representing the relationships of the original variables to the LDA solution). Interpret the results. Which variables are the most influencial linear separators for the clusters?
#Loading and scaling Boston data
scaled_Boston <- scale(Boston)
scaled_Boston <- as.data.frame(scaled_Boston)
#colnames(scaled_Boston)
#calculating the euclidean distance matrix between the observation
dist_eu <- dist(scaled_Boston)
#k-means clustering
set.seed(123)
km <-kmeans(dist_eu, centers = 4)
cm <- as.data.frame(km$cluster)
#adding the clusters to the scaled dataset
scaled_Boston <- data.frame(scaled_Boston, clust = cm)
colnames(scaled_Boston)[15] <- "clust"
summary(scaled_Boston)
## crim zn indus chas
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563 Min. :-0.2723
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668 1st Qu.:-0.2723
## Median :-0.390280 Median :-0.48724 Median :-0.2109 Median :-0.2723
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150 3rd Qu.:-0.2723
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202 Max. : 3.6648
## nox rm age dis
## Min. :-1.4644 Min. :-3.8764 Min. :-2.3331 Min. :-1.2658
## 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366 1st Qu.:-0.8049
## Median :-0.1441 Median :-0.1084 Median : 0.3171 Median :-0.2790
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059 3rd Qu.: 0.6617
## Max. : 2.7296 Max. : 3.5515 Max. : 1.1164 Max. : 3.9566
## rad tax ptratio black
## Min. :-0.9819 Min. :-1.3127 Min. :-2.7047 Min. :-3.9033
## 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876 1st Qu.: 0.2049
## Median :-0.5225 Median :-0.4642 Median : 0.2746 Median : 0.3808
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058 3rd Qu.: 0.4332
## Max. : 1.6596 Max. : 1.7964 Max. : 1.6372 Max. : 0.4406
## lstat medv clust
## Min. :-1.5296 Min. :-1.9063 Min. :1.000
## 1st Qu.:-0.7986 1st Qu.:-0.5989 1st Qu.:2.000
## Median :-0.1811 Median :-0.1449 Median :3.000
## Mean : 0.0000 Mean : 0.0000 Mean :2.943
## 3rd Qu.: 0.6024 3rd Qu.: 0.2683 3rd Qu.:4.000
## Max. : 3.5453 Max. : 2.9865 Max. :4.000
#Original Boston dataset is now scaled and the result of K-means clustering is saved to the variable *clust*
#LDA = linear discriminant analysis
lda.fit.km <- lda(clust ~. , data = scaled_Boston)
#print the lda.fit object
lda.fit.km
## Call:
## lda(clust ~ ., data = scaled_Boston)
##
## Prior probabilities of groups:
## 1 2 3 4
## 0.1304348 0.2272727 0.2114625 0.4308300
##
## Group means:
## crim zn indus chas nox rm age
## 1 1.4330759 -0.4872402 1.0689719 0.4435073 1.3439101 -0.7461469 0.8575386
## 2 0.2797949 -0.4872402 1.1892663 -0.2723291 0.8998296 -0.2770011 0.7716696
## 3 -0.3912182 1.2671159 -0.8754697 0.5739635 -0.7359091 0.9938426 -0.6949417
## 4 -0.3894453 -0.2173896 -0.5212959 -0.2723291 -0.5203495 -0.1157814 -0.3256000
## dis rad tax ptratio black lstat
## 1 -0.9620552 1.2941816 1.2970210 0.42015742 -1.65562038 1.1930953
## 2 -0.7723199 0.9006160 1.0311612 0.60093343 -0.01717546 0.6116223
## 3 0.7751031 -0.5965444 -0.6369476 -0.96586616 0.34190729 -0.8200275
## 4 0.3182404 -0.5741127 -0.6240070 0.02986213 0.34248644 -0.2813666
## medv
## 1 -0.81904111
## 2 -0.54636549
## 3 1.11919598
## 4 -0.01314324
##
## Coefficients of linear discriminants:
## LD1 LD2 LD3
## crim -0.18113078 0.5012256 0.60535205
## zn -0.43297497 1.0486194 -0.67406151
## indus -1.37753200 -0.3016928 -1.07034034
## chas 0.04307937 0.7598229 0.22448239
## nox -1.04674638 0.3861005 0.33268952
## rm 0.14912869 0.1510367 -0.67942589
## age 0.09897424 -0.0523110 -0.26285587
## dis -0.13139210 0.1593367 0.03487882
## rad -0.65824136 -0.5189795 -0.48145070
## tax -0.28903561 0.5773959 -0.10350513
## ptratio -0.22236843 -0.1668597 0.09181715
## black 0.42730704 -0.5843973 -0.89869354
## lstat -0.24320629 0.6197780 0.01119242
## medv -0.21961575 0.9485829 0.17065360
##
## Proportion of trace:
## LD1 LD2 LD3
## 0.7596 0.1768 0.0636
#the function for lda biplot arrows
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
#target classes as numeric
classes <- as.numeric(scaled_Boston$clust)
#classes
#plot the lda results
plot(lda.fit.km, dimen = 2, col = classes, pch = classes)
lda.arrows(lda.fit.km, myscale = 3)
#Super-bonus# 3D plot where observations color is the crime classes of the train set
model_predictors <- dplyr::select(train, -crime)
#check the dimensions
#dim(model_predictors)
#dim(lda.fit$scaling)
#matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit$scaling
matrix_product <- as.data.frame(matrix_product)
#3d plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = train$crime)
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
3D plot where observations color is based on the K-means clusters
model_predictors <- dplyr::select(scaled_Boston, -clust)
#check the dimensions
#dim(model_predictors)
#dim(lda.fit.km$scaling)
#matrix multiplication
matrix_product <- as.matrix(model_predictors) %*% lda.fit.km$scaling
matrix_product <- as.data.frame(matrix_product)
#3D plot
plot_ly(x = matrix_product$LD1, y = matrix_product$LD2, z = matrix_product$LD3, type= 'scatter3d', mode='markers', color = scaled_Boston$clust)
Colors of the both plots is based to four classes. It seems that K-means plot shows the different clusters more clearly than the plot that is based on the crime classification.